Skip to content

[2/7] Telemetry Infrastructure: CircuitBreaker and FeatureFlagCache#325

Merged
samikshya-db merged 27 commits into
mainfrom
telemetry-2-infrastructure
Apr 21, 2026
Merged

[2/7] Telemetry Infrastructure: CircuitBreaker and FeatureFlagCache#325
samikshya-db merged 27 commits into
mainfrom
telemetry-2-infrastructure

Conversation

@samikshya-db
Copy link
Copy Markdown
Collaborator

Part 2 of 7-part Telemetry Implementation Stack

This layer adds infrastructure components for resilience and feature flag management.

Summary

Implements CircuitBreaker for per-host endpoint protection and FeatureFlagCache for efficient feature flag management with reference counting.

Components

CircuitBreaker (lib/telemetry/CircuitBreaker.ts)

State machine for endpoint protection:

  • CLOSED (normal operation) → OPEN (failing) after threshold failures
  • OPENHALF_OPEN (testing) after timeout period
  • HALF_OPENCLOSED (recovered) after consecutive successes
  • HALF_OPENOPEN (still failing) on any failure

Configuration:

  • Failure threshold: 5 (default)
  • Timeout: 60s (default)
  • Success threshold: 2 (default)

CircuitBreakerRegistry

Manages circuit breakers per host for isolation:

  • Each host gets independent circuit breaker
  • Prevents cascade failures across hosts
  • Thread-safe per-host state management

FeatureFlagCache (lib/telemetry/FeatureFlagCache.ts)

Per-host feature flag caching with lifecycle management:

  • 15-minute TTL with automatic refresh
  • Reference counting tracks connection lifecycle
  • Context removed when refCount reaches zero
  • Reduces API calls for feature flag checks

Testing

  • 32 unit tests for CircuitBreaker (100% function coverage)
  • 29 unit tests for FeatureFlagCache (100% function coverage)
  • CircuitBreakerStub for testing dependent components
  • All state transitions and edge cases covered

Next Steps

This PR is followed by:

  • [3/7] Client Management: TelemetryClient and Provider
  • [4/7] Event & Aggregation: EventEmitter and MetricsAggregator
  • [5/7] Export: DatabricksTelemetryExporter
  • [6/7] Integration: Wire into DBSQLClient
  • [7/7] Testing & Documentation

Dependencies

Depends on: [1/7] Telemetry Foundation (#324)

samikshya-db and others added 5 commits January 29, 2026 20:20
This is part 2 of 7 in the telemetry implementation stack.

Components:
- CircuitBreaker: Per-host endpoint protection with state management
- FeatureFlagCache: Per-host feature flag caching with reference counting
- CircuitBreakerRegistry: Manages circuit breakers per host

Circuit Breaker:
- States: CLOSED (normal), OPEN (failing), HALF_OPEN (testing recovery)
- Default: 5 failures trigger OPEN, 60s timeout, 2 successes to CLOSE
- Per-host isolation prevents cascade failures
- All state transitions logged at debug level

Feature Flag Cache:
- Per-host caching with 15-minute TTL
- Reference counting for connection lifecycle management
- Automatic cache expiration and refetch
- Context removed when refCount reaches zero

Testing:
- 32 comprehensive unit tests for CircuitBreaker
- 29 comprehensive unit tests for FeatureFlagCache
- 100% function coverage, >80% line/branch coverage
- CircuitBreakerStub for testing other components

Dependencies:
- Builds on [1/7] Types and Exception Classifier
Implements getAuthHeaders() method for authenticated REST API requests:
- Added getAuthHeaders() to IClientContext interface
- Implemented in DBSQLClient using authProvider.authenticate()
- Updated FeatureFlagCache to fetch from connector-service API with auth
- Added driver version support for version-specific feature flags
- Replaced placeholder implementation with actual REST API calls

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
- Change feature flag endpoint to use NODEJS client type
- Fix telemetry endpoints to /telemetry-ext and /telemetry-unauth
- Update payload to match proto with system_configuration
- Add shared buildUrl utility for protocol handling
- Change payload structure to match JDBC: uploadTime, items, protoLogs
- protoLogs contains JSON-stringified TelemetryFrontendLog objects
- Remove workspace_id (JDBC doesn't populate it)
- Remove debug logs added during testing
- Fix import order in FeatureFlagCache
- Replace require() with import for driverVersion
- Fix variable shadowing
- Disable prefer-default-export for urlUtils
Fix TypeScript compilation error by implementing getAuthHeaders
method required by IClientContext interface.
Added osArch, runtimeVendor, localeName, charSetEncoding, and
processName fields to DriverConfiguration to match JDBC implementation.
Comment thread lib/telemetry/CircuitBreaker.ts
const authHeaders = authenticatedExport ? await this.context.getAuthHeaders() : {};

// Make HTTP POST request with authentication
const response: Response = await this.fetchFn(endpoint, {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does this support proxy?

Copy link
Copy Markdown
Collaborator Author

@samikshya-db samikshya-db Feb 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great point! This is a miss on the PR : included this now. [I have merged this as part of PR #330 : as it had other changes around calling telemetry endpoint too]

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. Both the exporter and FeatureFlagCache route HTTP through connectionProvider.getAgent(), which picks up proxy config from ConnectionOptions.proxy via HttpConnection (same plumbing CloudFetchResultHandler and DBSQLSession use). See sendRequest/fetchWithRetry in both files.

Comment thread lib/telemetry/TelemetryEventEmitter.ts
Comment thread lib/telemetry/DatabricksTelemetryExporter.ts Outdated
Comment thread lib/telemetry/DatabricksTelemetryExporter.ts Outdated
char_set_encoding?: string;
process_name?: string;
};
driver_connection_params?: any;
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[F19] driver_connection_params?: any — dead scaffolding (Severity: Medium)

Declared on DatabricksTelemetryLog but never populated anywhere (grep of toTelemetryLog and producers confirms). An untyped any slot with no producer is exactly the kind of "we'll fill it in later" shape that decays.

Suggested fix: Delete the field. Re-add it alongside its producer when connection-parameter shipping actually lands, and give it a concrete type at that point.

Posted by code-review-squad • flagged by maintainability, language.

: buildTelemetryUrl(this.host, '/telemetry-unauth');

// Format payload - each log is JSON-stringified to match JDBC format
const telemetryLogs = metrics.map((m) => this.toTelemetryLog(m));
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[F20] Double JSON.stringify inflates body and CPU (Severity: Medium)

Each log is stringified (logs.map(log => JSON.stringify(log))), then embedded as string elements in the outer payload which is re-stringified — every " becomes \". At large batch sizes this is measurable CPU and body-size overhead, and it roughly doubles the wire bytes for a payload that is already mostly JSON escape sequences.

Suggested fix: Build the body by hand:

const body = `{"uploadTime":${Date.now()},"items":[],"protoLogs":[${logs.map(l => JSON.stringify(l)).join(',')}]}`;

Or emit protoLogs as a JSON array directly and have the server parse it as such.

Posted by code-review-squad • flagged by performance.

@vikrantpuppala
Copy link
Copy Markdown
Collaborator

Code Review Squad Report

Merge Safety Score: 54/100 — HIGH RISK

9 parallel reviewers (security, architecture, language, ops-safety, performance, tests, maintainability, agent-compat, devil's advocate) reviewed this PR. Summary: good scaffolding and happy-path test coverage, but blocking issues around data loss on shutdown, host-validation-based SSRF, PII leakage, and a concurrency race in the circuit breaker.

Findings posted as inline comments

High severity (9)

  • F1CircuitBreaker.ts:94 — HALF_OPEN concurrent-probe race
  • F2MetricsAggregator.ts:319 — silent final-metrics loss on flush/close
  • F3telemetryUtils.ts:21 — weak host validation (SSRF / bearer exfiltration)
  • F4DatabricksTelemetryExporter.ts:329 — PII leak + stack_trace is actually the error message
  • F5DatabricksTelemetryExporter.ts:147 — retry/backoff duplicates HttpRetryPolicy; bypasses HttpConnection
  • F6FeatureFlagCache.ts:38 — whole class landed with no in-PR consumer
  • F7IClientContext.ts:49getAuthHeaders bloats the contract; HeadersInit leaks node-fetch
  • F8CircuitBreaker.ts:96 — breaker-open identified by string-match; add a typed error
  • F9DatabricksTelemetryExporter.ts:209/telemetry-unauth payload still carries workspace/session/statement IDs

Medium severity (11)

  • F10MetricsAggregator.ts:294 — O(n) slice() on every overflow push
  • F11CircuitBreaker.ts:205 — registry never shrinks; per-host leak
  • F12MetricsAggregator.ts:57statementMetrics / per-statement errors unbounded
  • F13MetricsAggregator.ts:298 — FIFO drops first-failure errors (most valuable signal)
  • F14DatabricksTelemetryExporter.ts:155 — off-by-one: maxRetries=3 → 4 HTTP calls
  • F15DatabricksTelemetryExporter.ts:181 — backoff hardcoded; total can overshoot cap
  • F16DatabricksTelemetryExporter.ts:237 — missing-auth silently counts as breaker success
  • F17types.ts:215processName exports absolute paths (PII)
  • F18tests/unit/telemetry/* — missing concurrent-probe, auth-header merge, thrown-error-retry, real-fetch tests
  • F19DatabricksTelemetryExporter.ts:57 — dead driver_connection_params?: any scaffolding
  • F20DatabricksTelemetryExporter.ts:215 — double JSON.stringify inflates body and CPU
  • F21 — multiple — catch (error: any) crashes on non-Error throws inside "never-throws" code

Low severity (12, not posted inline)

Runtime-name hardcoded (F22); over-exposed test accessors getFailureCount/getSuccessCount (F23); timer reset starves periodic flush (F24); single-function util file (F25); one-line delegate methods (F26); misleading success log (F27); Date vs epoch-ms (F28); unseeded Math.random in retry jitter (F29); telemetry surface not exported from lib/index.ts (F30); ClientConfig.telemetry* keys lack TSDoc + defaults (F31); agent rejectUnauthorized: false widens bearer exposure — repo-wide posture, escalate separately (F32); DatabricksTelemetryLog permits illegal field combinations — use a discriminated union (F33).

Verified & dismissed

  • Double-scheme bug (https://https://...) — false alarm; regex strips it.
  • Proxy support — resolved in THIS PR via connectionProvider.getAgent() in both exporter and FeatureFlagCache. The "deferred to PR [7/7] Telemetry Testing and Documentation - FINAL LAYER #330" note in prior review is now stale.
  • Timer keepalive — correctly handled via flushTimer.unref() at MetricsAggregator.ts:366.
  • Custom CircuitBreaker (not opossum/cockatiel) — accepted per Node >=14 constraint.
  • urlUtils → telemetryUtils rename — done.

Ship recommendation

Address F1–F9 before merge. F10–F21 can land as a follow-up before PR 3/7 wires this into DBSQLClient. F32 (shared-agent rejectUnauthorized: false) is a repo-wide issue that pre-dates this PR but is worth a separate ticket since telemetry now widens bearer exposure.


Feedback? Drop it in #code-review-squad-feedback.

@samikshya-db
Copy link
Copy Markdown
Collaborator Author

Addressing Vikrant's review. Summary of changes in the latest push:

Direct comments

  • CircuitBreaker.ts (jade/vikrant +1): fixed the HALF_OPEN concurrent-probe race — admission is now a single synchronous check-and-set (tryAdmit), and added a test that launches two execute()s concurrently after the OPEN→HALF_OPEN transition and asserts exactly one probe runs.
  • DatabricksTelemetryExporter user-agent: now uses lib/utils/buildUserAgentString.ts instead of constructing ad-hoc; same helper applied in FeatureFlagCache. One place to update.
  • FeatureFlagCache / exporter HTTP client: both now use the same connectionProvider.getAgent() + fetch pattern already used by CloudFetchResultHandler/DBSQLSession, picking up proxy support.
  • urlUtils file: renamed telemetryUtils and now carries three related helpers (URL validation, secret redactor, process-name sanitizer), so it's no longer a single-function file.
  • MetricsAggregator bound: statementMetrics now has a TTL (statementTtlMs, default 1h) and per-statement error buffer is capped (maxErrorsPerStatement, default 50). Overflow on pendingMetrics prefers dropping non-error entries.

F1–F20 (posted via code-review-squad)

  • F1 HALF_OPEN race: fixed (see above); new test.
  • F2 no-data-loss close: flush() is now async and returns the export promise; close() awaits it.
  • F3 host validation: buildTelemetryUrl parses with new URL, rejects non-root path, query, fragment, userinfo, CR/LF. Malformed host throws AuthenticationError so the batch is dropped.
  • F4 stack-trace PII: added errorStack to TelemetryEvent/TelemetryMetric/emitError; redactSensitive strips Bearer/token/password/api_key and caps at 2 KiB.
  • F5 / F14 / F15: retry loop is now config-driven (telemetryBackoffBaseMs/MaxMs/JitterMs), maxRetries matches HttpRetryPolicy semantics (retries after initial), classification goes through ExceptionClassifier. Full reuse of HttpRetryPolicy would need it to model thrown errors (not just response codes) — happy to factor into a shared TelemetryRetryPolicy if preferred.
  • F6: see below.
  • F7 IClientContext bloat: removed getAuthHeaders() from IClientContext. Exporter and FFC now take IAuthentication directly. HeadersInit no longer leaks into the contract.
  • F8 named error: CircuitBreakerOpenError / CIRCUIT_BREAKER_OPEN_CODE exported; call sites use instanceof.
  • F9 unauth correlation: unauthenticated export now scrubs session_id, statement_id, workspace_id and drops raw error detail.
  • F10 quadratic overflow: replaced slice() with a single splice(); log message corrected.
  • F11 registry leak: added DatabricksTelemetryExporter.dispose() which calls registry.removeCircuitBreaker(host); the consumer in a later PR will call this from DBSQLClient.close().
  • F12 unbounded maps: statementTtlMs + maxErrorsPerStatement + eviction in getOrCreateStatementDetails.
  • F13 overflow policy: preserves error metrics over connection/statement.
  • F16 auth-missing as failure: now throws AuthenticationError so the breaker counts it; header presence is checked case-insensitively after normalising all three HeadersInit shapes; warns once per exporter instance on first skip.
  • F17 processName PII: sanitizeProcessName returns the basename; applied at export time.
  • F19 dead field: driver_connection_params?: any removed.
  • F20 double stringify: the outer payload stays stringified (server contract), but inner logs are JSON objects so only one escape layer remains.

F6 — FFC without consumer: this is a 7-PR stack; the consumer lands in [5/7]. To de-risk the "shape drifts before it's exercised" concern, FFC and the exporter now share the HTTP/auth/user-agent plumbing, so any drift would break both at once. Happy to drop FFC from this PR and re-add it with its consumer if you'd rather.

All 145 telemetry unit tests pass, plus the full unit suite (632 passing) — excluding two OAuth test files whose pre-existing [Symbol.asyncDispose] TS errors were there before this branch. ESLint clean.

Security (Critical):
- buildTelemetryUrl now rejects protocol-relative //prefix, zero-width
  and BOM codepoints, CR/LF/tab, userinfo, path/query/fragment, and
  loopback/RFC1918/IMDS/localhost/GCP+Azure-metadata hosts. Defeats
  the SSRF-shaped Bearer-token exfil vector an attacker-influenced
  host (env var, tampered config) could trigger.
- redactSensitive now covers real Databricks secret shapes: dapi/
  dkea/dskea/dsapi/dose PATs, JWT triplets, JSON-quoted access_token/
  client_secret/refresh_token/id_token/password/api_key, Basic auth,
  URL-embedded credentials. Re-applies after truncation.
- Unauthenticated export now omits system_configuration entirely,
  strips userAgentEntry from User-Agent, and blanks stack_trace, so
  on-path observers cannot re-identify clients on the unauth path.
- sanitizeProcessName drops argv tail (handles node --db-password=X
  app.js shape).

Correctness (High):
- MetricsAggregator gained a closed flag; close() no longer races
  with batch-triggered flushes that would resurrect the interval.
- evictExpiredStatements now runs from the periodic flush timer so
  idle clients actually reclaim orphan statement entries.
- Evicted statements emit their buffered error events as standalone
  metrics before being dropped — first-failure signal survives.
- Batch-size and terminal-error flush paths pass resetTimer=false so
  sustained overflow cant starve the periodic tail drain.
- TelemetryTerminalError introduced for host-validation refusals,
  separating that taxonomy from AuthenticationError.
- authMissingWarned re-arms after a successful export so operators
  see a fresh signal the next time auth breaks.
- Retry log denominator uses totalAttempts (not maxRetries); negative
  maxRetries falls back to default; retry log includes the redacted
  failing error so ops can see whats being retried.

API / hygiene:
- CircuitBreakerOpenError, CIRCUIT_BREAKER_OPEN_CODE, and
  TelemetryTerminalError re-exported from lib/index.ts so consumers
  can instanceof-check.
- DBSQLClient.getAuthProvider marked @internal.
- DEFAULT_TELEMETRY_CONFIG / DEFAULT_CIRCUIT_BREAKER_CONFIG frozen.
- pushBoundedError uses if instead of while.
- CIRCUIT_BREAKER_OPEN_CODE typed as const.
- Default export on buildTelemetryUrl removed (no callers).
- Dropped wasted new Error allocation in processErrorEvent.

Tests:
- New telemetryUtils.test.ts (53 tests): URL-rejection table covering
  every known bypass, redactor shapes, sanitize process name.
- DatabricksTelemetryExporter: 13 new tests covering Authorization
  on-the-wire, authMissingWarned idempotency + re-arm, unauth
  correlation/system_configuration/userAgentEntry/stack_trace
  stripping, malformed-host drop, loopback drop, dispose idempotency,
  errorStack to redacted stack_trace flow.
- MetricsAggregator: 2 new tests for async close() awaiting the
  exporter promise (prevents process.exit truncation) and no timer
  resurrection after close.

700 unit tests pass, ESLint clean.

Co-authored-by: Isaac
Signed-off-by: samikshya-chand_data <samikshya.chand@databricks.com>
Clarify that the JWT string in the redactor test is intentionally fake
and is built from parts so the assembled token never appears as a
source literal (to satisfy pre-commit secret scanners).

Co-authored-by: Isaac
Signed-off-by: samikshya-chand_data <samikshya.chand@databricks.com>
@github-actions
Copy link
Copy Markdown

Thanks for your contribution! To satisfy the DCO policy in our contributing guide every commit message must include a sign-off message. One or more of your commits is missing this message. You can reword previous commit messages with an interactive rebase (git rebase -i main).

CI's TypeScript was stricter than the local version and rejected the
untyped `c.args[1]` passed to `RegExp.test()`. Wrap in `String(...)`
so the tests compile on Node 14/16/18/20 runners.

Co-authored-by: Isaac
Signed-off-by: samikshya-chand_data <samikshya.chand@databricks.com>
@github-actions
Copy link
Copy Markdown

Thanks for your contribution! To satisfy the DCO policy in our contributing guide every commit message must include a sign-off message. One or more of your commits is missing this message. You can reword previous commit messages with an interactive rebase (git rebase -i main).

Comment thread lib/telemetry/CircuitBreaker.ts Outdated

// In HALF_OPEN state, allow only one probe request at a time
if (this.state === CircuitBreakerState.HALF_OPEN && this.halfOpenInflight > 0) {
throw new Error('Circuit breaker OPEN');
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

HALF_OPEN?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed: HALF_OPEN now admits only one probe at a time via atomic tryAdmit(). Any failure there immediately reopens the circuit; successThreshold consecutive successes close it. New concurrent-probe test at CircuitBreaker.test.ts:~411 verifies the invariant.

return out;
}
const out: Record<string, string> = {};
for (const [k, v] of Object.entries(raw as Record<string, unknown>)) {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we do type check here?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added — normalizeHeaders now guards with typeof raw === 'object' before Object.entries, and filters out non-string values and non-string keys. Lives in telemetryUtils.ts and is shared between exporter and FFC (addresses your duplication comment below as well).

Comment thread lib/telemetry/ExceptionClassifier.ts Outdated
const errorCode = (error as any).code;
if (
errorCode === 'ECONNREFUSED' ||
errorCode === 'ENOTFOUND' ||
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

notfound should not be transient

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how about other code like,

  • ECONNRESET
  • ETIMEDOUT
  • EHOSTUNREACH
  • EAI_AGAIN

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Removed ENOTFOUND from retryable with a comment explaining why (DNS "not found" is deterministic — retrying just pushes load at the resolver). Kept ECONNRESET and EHOSTUNREACH; added ETIMEDOUT and EAI_AGAIN. Tests updated in ExceptionClassifier.test.ts (inverted ENOTFOUND expectation, added cases for the two new codes).

this.userAgent = buildUserAgentString(this.context.getConfig().userAgentEntry);
}

getOrCreateContext(host: string): FeatureFlagContext {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this thread safe?

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

may not apply to node, please double check

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Confirmed per your follow-up — Node is single-threaded, all Map accesses happen in sequential event-loop ticks. No concurrent mutation is possible. In-flight fetchPromises dedup handles the thundering-herd case where two refCount-bumps race with an expired cache.

return ctx;
}

releaseContext(host: string): void {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make this thread safe

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above — single event loop, no concurrent access needed.


const isExpired = !ctx.lastFetched || Date.now() - ctx.lastFetched.getTime() > ctx.cacheDuration;

if (isExpired) {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i know this call is async, but can we make this call out of this loop, find some way that we can trigger the fetch automatically so that we don't see delay here.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The blocking fetch only happens once per host per 15 minutes (cache TTL), and concurrent callers are deduped via fetchPromises so only one in-flight request per host. First call after cache expiry does take the round-trip latency. Proactive refresh is feasible but needs a trigger signal from the consumer (currently none exists — the consumer-wiring lands in #327). Happy to add a background refresher in that PR once we know the call pattern.

Comment thread lib/telemetry/FeatureFlagCache.ts Outdated

logger.log(LogLevel.debug, `Fetching feature flags from ${endpoint}`);

const response = await this.sendRequest(endpoint, {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we have retry on this?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added. fetchWithRetry now retries once on transient errors (ECONNRESET / ETIMEDOUT / EHOSTUNREACH / EAI_AGAIN / 5xx) via the shared ExceptionClassifier. One retry is intentional: without it, a single DNS hiccup disables telemetry for the full 15-minute TTL; more than one retry starts hammering a broken endpoint.

return fetch(url, { ...init, agent });
}

private async getAuthHeaders(): Promise<Record<string, string>> {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this code seem duplicated

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Extracted. normalizeHeaders and hasAuthorization moved to telemetryUtils.ts; both exporter and FFC now import them. ~40 lines of drift risk gone.

- ExceptionClassifier: remove ENOTFOUND from retryable (DNS "not found"
  is deterministic — retrying just pushes load at the resolver without
  any expectation of success). Add ETIMEDOUT and EAI_AGAIN per Jade's
  follow-up list.
- Extract shared `normalizeHeaders` + `hasAuthorization` helpers into
  telemetryUtils; DatabricksTelemetryExporter and FeatureFlagCache both
  use them — eliminates the ~40-line duplication Jade flagged.
- normalizeHeaders now guards `typeof raw === 'object'` before
  Object.entries, preventing Object.entries('some-string') index entries
  from leaking in as headers (Jade: "should we do type check here?").
- FeatureFlagCache.fetchFeatureFlag: add single-retry on transient errors
  (classified via ExceptionClassifier). Without a retry, one DNS hiccup
  would disable telemetry for the full 15-minute cache TTL; one retry
  gives an ephemeral failure a second chance without pushing sustained
  load at a broken endpoint.
- Drop the private hasAuthorization/normalizeHeaders on the exporter;
  drop the inlined branching in FFC getAuthHeaders.
- Update ExceptionClassifier tests: invert ENOTFOUND expectation with
  a comment explaining why; add cases for ETIMEDOUT and EAI_AGAIN.

702 unit tests pass, ESLint clean.

Co-authored-by: Isaac
Signed-off-by: samikshya-chand_data <samikshya.chand@databricks.com>
@github-actions
Copy link
Copy Markdown

Thanks for your contribution! To satisfy the DCO policy in our contributing guide every commit message must include a sign-off message. One or more of your commits is missing this message. You can reword previous commit messages with an interactive rebase (git rebase -i main).

Comment thread lib/telemetry/DatabricksTelemetryExporter.ts
Comment thread lib/telemetry/FeatureFlagCache.ts
Comment thread lib/telemetry/MetricsAggregator.ts
@samikshya-db samikshya-db merged commit 46e2d6e into main Apr 21, 2026
7 of 8 checks passed
@samikshya-db samikshya-db deleted the telemetry-2-infrastructure branch April 21, 2026 05:26
@samikshya-db
Copy link
Copy Markdown
Collaborator Author

Hello @vikrantpuppala - saw your comments after the merge. addressed it here : https://github.com/databricks/databricks-sql-nodejs/pull/362/changes
Apologies for the churn !

samikshya-db added a commit that referenced this pull request Apr 22, 2026
* Add telemetry infrastructure: CircuitBreaker and FeatureFlagCache

This is part 2 of 7 in the telemetry implementation stack.

Components:
- CircuitBreaker: Per-host endpoint protection with state management
- FeatureFlagCache: Per-host feature flag caching with reference counting
- CircuitBreakerRegistry: Manages circuit breakers per host

Circuit Breaker:
- States: CLOSED (normal), OPEN (failing), HALF_OPEN (testing recovery)
- Default: 5 failures trigger OPEN, 60s timeout, 2 successes to CLOSE
- Per-host isolation prevents cascade failures
- All state transitions logged at debug level

Feature Flag Cache:
- Per-host caching with 15-minute TTL
- Reference counting for connection lifecycle management
- Automatic cache expiration and refetch
- Context removed when refCount reaches zero

Testing:
- 32 comprehensive unit tests for CircuitBreaker
- 29 comprehensive unit tests for FeatureFlagCache
- 100% function coverage, >80% line/branch coverage
- CircuitBreakerStub for testing other components

Dependencies:
- Builds on [1/7] Types and Exception Classifier

* Add authentication support for REST API calls

Implements getAuthHeaders() method for authenticated REST API requests:
- Added getAuthHeaders() to IClientContext interface
- Implemented in DBSQLClient using authProvider.authenticate()
- Updated FeatureFlagCache to fetch from connector-service API with auth
- Added driver version support for version-specific feature flags
- Replaced placeholder implementation with actual REST API calls

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

* Fix feature flag and telemetry export endpoints

- Change feature flag endpoint to use NODEJS client type
- Fix telemetry endpoints to /telemetry-ext and /telemetry-unauth
- Update payload to match proto with system_configuration
- Add shared buildUrl utility for protocol handling

* Match JDBC telemetry payload format

- Change payload structure to match JDBC: uploadTime, items, protoLogs
- protoLogs contains JSON-stringified TelemetryFrontendLog objects
- Remove workspace_id (JDBC doesn't populate it)
- Remove debug logs added during testing

* Fix lint errors

- Fix import order in FeatureFlagCache
- Replace require() with import for driverVersion
- Fix variable shadowing
- Disable prefer-default-export for urlUtils

* Add telemetry infrastructure: CircuitBreaker and FeatureFlagCache

This is part 2 of 7 in the telemetry implementation stack.

Components:
- CircuitBreaker: Per-host endpoint protection with state management
- FeatureFlagCache: Per-host feature flag caching with reference counting
- CircuitBreakerRegistry: Manages circuit breakers per host

Circuit Breaker:
- States: CLOSED (normal), OPEN (failing), HALF_OPEN (testing recovery)
- Default: 5 failures trigger OPEN, 60s timeout, 2 successes to CLOSE
- Per-host isolation prevents cascade failures
- All state transitions logged at debug level

Feature Flag Cache:
- Per-host caching with 15-minute TTL
- Reference counting for connection lifecycle management
- Automatic cache expiration and refetch
- Context removed when refCount reaches zero

Testing:
- 32 comprehensive unit tests for CircuitBreaker
- 29 comprehensive unit tests for FeatureFlagCache
- 100% function coverage, >80% line/branch coverage
- CircuitBreakerStub for testing other components

Dependencies:
- Builds on [1/7] Types and Exception Classifier

* Add telemetry client management: TelemetryClient and Provider

This is part 3 of 7 in the telemetry implementation stack.

Components:
- TelemetryClient: HTTP client for telemetry export per host
- TelemetryClientProvider: Manages per-host client lifecycle with reference counting

TelemetryClient:
- Placeholder HTTP client for telemetry export
- Per-host isolation for connection pooling
- Lifecycle management (open/close)
- Ready for future HTTP implementation

TelemetryClientProvider:
- Reference counting tracks connections per host
- Automatically creates clients on first connection
- Closes and removes clients when refCount reaches zero
- Thread-safe per-host management

Design Pattern:
- Follows JDBC driver pattern for resource management
- One client per host, shared across connections
- Efficient resource utilization
- Clean lifecycle management

Testing:
- 31 comprehensive unit tests for TelemetryClient
- 31 comprehensive unit tests for TelemetryClientProvider
- 100% function coverage, >80% line/branch coverage
- Tests verify reference counting and lifecycle

Dependencies:
- Builds on [1/7] Types and [2/7] Infrastructure

* Add authentication support for REST API calls

Implements getAuthHeaders() method for authenticated REST API requests:
- Added getAuthHeaders() to IClientContext interface
- Implemented in DBSQLClient using authProvider.authenticate()
- Updated FeatureFlagCache to fetch from connector-service API with auth
- Added driver version support for version-specific feature flags
- Replaced placeholder implementation with actual REST API calls

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

* Fix feature flag and telemetry export endpoints

- Change feature flag endpoint to use NODEJS client type
- Fix telemetry endpoints to /telemetry-ext and /telemetry-unauth
- Update payload to match proto with system_configuration
- Add shared buildUrl utility for protocol handling

* Match JDBC telemetry payload format

- Change payload structure to match JDBC: uploadTime, items, protoLogs
- protoLogs contains JSON-stringified TelemetryFrontendLog objects
- Remove workspace_id (JDBC doesn't populate it)
- Remove debug logs added during testing

* Fix lint errors

- Fix import order in FeatureFlagCache
- Replace require() with import for driverVersion
- Fix variable shadowing
- Disable prefer-default-export for urlUtils

* Add missing getAuthHeaders method to ClientContextStub

Fix TypeScript compilation error by implementing getAuthHeaders
method required by IClientContext interface.

* Add missing getAuthHeaders method to ClientContextStub

Fix TypeScript compilation error by implementing getAuthHeaders
method required by IClientContext interface.

* Fix prettier formatting

* Fix prettier formatting

* Add DRIVER_NAME constant for nodejs-sql-driver

* Add DRIVER_NAME constant for nodejs-sql-driver

* Add missing telemetry fields to match JDBC

Added osArch, runtimeVendor, localeName, charSetEncoding, and
processName fields to DriverConfiguration to match JDBC implementation.

* Add missing telemetry fields to match JDBC

Added osArch, runtimeVendor, localeName, charSetEncoding, and
processName fields to DriverConfiguration to match JDBC implementation.

* Fix TypeScript compilation: add missing fields to system_configuration interface

* Fix TypeScript compilation: add missing fields to system_configuration interface

* Fix telemetry PR review comments from #325

Three fixes addressing review feedback:

1. Fix documentation typo (sreekanth-db comment)
   - DatabricksTelemetryExporter.ts:94
   - Changed "TelemetryFrontendLog" to "DatabricksTelemetryLog"

2. Add proxy support (jadewang-db comment)
   - DatabricksTelemetryExporter.ts:exportInternal()
   - Get HTTP agent from connection provider
   - Pass agent to fetch for proxy support
   - Follows same pattern as CloudFetchResultHandler and DBSQLSession
   - Supports http/https/socks proxies with authentication

3. Fix flush timer to prevent rate limiting (sreekanth-db comment)
   - MetricsAggregator.ts:flush()
   - Reset timer after manual flushes (batch size, terminal errors)
   - Ensures consistent 30s spacing between exports
   - Prevents rapid successive flushes (e.g., batch at 25s, timer at 30s)

All changes follow existing driver patterns and maintain backward compatibility.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

* Add proxy support to feature flag fetching

Feature flag fetching was also missing proxy support like telemetry
exporter was. Applied the same fix:

- Get HTTP agent from connection provider
- Pass agent to fetch call for proxy support
- Follows same pattern as CloudFetchResultHandler and DBSQLSession
- Supports http/https/socks proxies with authentication

This completes proxy support for all HTTP operations in the telemetry
system (both telemetry export and feature flag fetching).

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

* Merge main into telemetry-3-client-management; prefer main versions for infrastructure

Takes main's versions of CircuitBreaker, FeatureFlagCache, DatabricksTelemetryExporter,
MetricsAggregator, TelemetryEventEmitter, and types — bringing in SSRF hardening,
overflow protection, async flush/close, errorStack redaction, and IAuthentication-based
auth headers. Removes IClientContext.getAuthHeaders() in favour of the direct
authProvider injection pattern from main.

Co-authored-by: samikshya-chand_data

* fix: prettier formatting and add coverage/ to .prettierignore

Co-authored-by: samikshya-chand_data

* fix: delete host entry before awaiting close to prevent stale-client race

If getOrCreateClient ran after refCount hit 0 but before clients.delete(host),
it would receive a closing TelemetryClient. Deleting synchronously first means
any concurrent caller gets a fresh instance instead.

Co-authored-by: samikshya-chand_data

* refactor: simplify TelemetryClient.close() and TelemetryClientProvider

- close() made synchronous (no I/O, no reason for async); uses finally to
  guarantee this.closed=true even when logger throws
- TelemetryClientProvider.releaseClient() made synchronous for the same reason
- Misleading ConcurrentHashMap reference removed from docstring
- getRefCount/getActiveClients marked @internal (test-only surface)
- Update tests to match: rejects→throws stubs, remove await on sync calls

Co-authored-by: samikshya-chand_data

* fix: address PR #326 review feedback from vikrant

- F5: Add refCount underflow guard and defensive warn-log in releaseClient
  so a double-release cannot drive the count negative. Also add a test that
  double-release does not throw or corrupt state.
- F6: Normalize host used as map key (lowercase, strip scheme, default port,
  trailing slash/dot, whitespace). Aliases like https://Example.COM/ now
  coalesce to a single entry; add a soft-limit warn at 128 entries to
  surface alias/leak bugs. Tests added for normalization + alias-release.
- F9: Replace `catch (error: any) { error.message }` with a narrowed
  `error instanceof Error ? error.message : String(error)` so non-Error
  throws (string, null) cannot escape the swallow-all contract. Test added.
- F10: Rewrite header and inline comments to describe the current sync
  close() semantics instead of an async-await model the code does not
  implement; note the invariant to re-establish when close() becomes async.
- F12: Relax exact debug-log string assertions to regex matchers so
  cosmetic log rewording does not break unrelated tests. Replace the
  `expect(true).to.be.true` tautology in TelemetryClient.test.ts with a
  behavioural `expect(() => client.close()).to.not.throw()`.

Signed-off-by: samikshya-chand_data <samikshya.chand@databricks.com>

Co-authored-by: Isaac
Signed-off-by: samikshya-chand_data <samikshya.chand@databricks.com>

* style: apply prettier formatting to TelemetryClientProvider.test.ts

Co-authored-by: Isaac
Signed-off-by: samikshya-chand_data <samikshya.chand@databricks.com>

---------

Signed-off-by: samikshya-chand_data <samikshya.chand@databricks.com>
Co-authored-by: Claude Sonnet 4.5 <noreply@anthropic.com>
@vikrantpuppala vikrantpuppala mentioned this pull request May 7, 2026
4 tasks
samikshya-db added a commit that referenced this pull request May 23, 2026
* Add telemetry infrastructure: CircuitBreaker and FeatureFlagCache

This is part 2 of 7 in the telemetry implementation stack.

Components:
- CircuitBreaker: Per-host endpoint protection with state management
- FeatureFlagCache: Per-host feature flag caching with reference counting
- CircuitBreakerRegistry: Manages circuit breakers per host

Circuit Breaker:
- States: CLOSED (normal), OPEN (failing), HALF_OPEN (testing recovery)
- Default: 5 failures trigger OPEN, 60s timeout, 2 successes to CLOSE
- Per-host isolation prevents cascade failures
- All state transitions logged at debug level

Feature Flag Cache:
- Per-host caching with 15-minute TTL
- Reference counting for connection lifecycle management
- Automatic cache expiration and refetch
- Context removed when refCount reaches zero

Testing:
- 32 comprehensive unit tests for CircuitBreaker
- 29 comprehensive unit tests for FeatureFlagCache
- 100% function coverage, >80% line/branch coverage
- CircuitBreakerStub for testing other components

Dependencies:
- Builds on [1/7] Types and Exception Classifier

* Add authentication support for REST API calls

Implements getAuthHeaders() method for authenticated REST API requests:
- Added getAuthHeaders() to IClientContext interface
- Implemented in DBSQLClient using authProvider.authenticate()
- Updated FeatureFlagCache to fetch from connector-service API with auth
- Added driver version support for version-specific feature flags
- Replaced placeholder implementation with actual REST API calls

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

* Fix feature flag and telemetry export endpoints

- Change feature flag endpoint to use NODEJS client type
- Fix telemetry endpoints to /telemetry-ext and /telemetry-unauth
- Update payload to match proto with system_configuration
- Add shared buildUrl utility for protocol handling

* Match JDBC telemetry payload format

- Change payload structure to match JDBC: uploadTime, items, protoLogs
- protoLogs contains JSON-stringified TelemetryFrontendLog objects
- Remove workspace_id (JDBC doesn't populate it)
- Remove debug logs added during testing

* Fix lint errors

- Fix import order in FeatureFlagCache
- Replace require() with import for driverVersion
- Fix variable shadowing
- Disable prefer-default-export for urlUtils

* Add telemetry infrastructure: CircuitBreaker and FeatureFlagCache

This is part 2 of 7 in the telemetry implementation stack.

Components:
- CircuitBreaker: Per-host endpoint protection with state management
- FeatureFlagCache: Per-host feature flag caching with reference counting
- CircuitBreakerRegistry: Manages circuit breakers per host

Circuit Breaker:
- States: CLOSED (normal), OPEN (failing), HALF_OPEN (testing recovery)
- Default: 5 failures trigger OPEN, 60s timeout, 2 successes to CLOSE
- Per-host isolation prevents cascade failures
- All state transitions logged at debug level

Feature Flag Cache:
- Per-host caching with 15-minute TTL
- Reference counting for connection lifecycle management
- Automatic cache expiration and refetch
- Context removed when refCount reaches zero

Testing:
- 32 comprehensive unit tests for CircuitBreaker
- 29 comprehensive unit tests for FeatureFlagCache
- 100% function coverage, >80% line/branch coverage
- CircuitBreakerStub for testing other components

Dependencies:
- Builds on [1/7] Types and Exception Classifier

* Add telemetry client management: TelemetryClient and Provider

This is part 3 of 7 in the telemetry implementation stack.

Components:
- TelemetryClient: HTTP client for telemetry export per host
- TelemetryClientProvider: Manages per-host client lifecycle with reference counting

TelemetryClient:
- Placeholder HTTP client for telemetry export
- Per-host isolation for connection pooling
- Lifecycle management (open/close)
- Ready for future HTTP implementation

TelemetryClientProvider:
- Reference counting tracks connections per host
- Automatically creates clients on first connection
- Closes and removes clients when refCount reaches zero
- Thread-safe per-host management

Design Pattern:
- Follows JDBC driver pattern for resource management
- One client per host, shared across connections
- Efficient resource utilization
- Clean lifecycle management

Testing:
- 31 comprehensive unit tests for TelemetryClient
- 31 comprehensive unit tests for TelemetryClientProvider
- 100% function coverage, >80% line/branch coverage
- Tests verify reference counting and lifecycle

Dependencies:
- Builds on [1/7] Types and [2/7] Infrastructure

* Add authentication support for REST API calls

Implements getAuthHeaders() method for authenticated REST API requests:
- Added getAuthHeaders() to IClientContext interface
- Implemented in DBSQLClient using authProvider.authenticate()
- Updated FeatureFlagCache to fetch from connector-service API with auth
- Added driver version support for version-specific feature flags
- Replaced placeholder implementation with actual REST API calls

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

* Fix feature flag and telemetry export endpoints

- Change feature flag endpoint to use NODEJS client type
- Fix telemetry endpoints to /telemetry-ext and /telemetry-unauth
- Update payload to match proto with system_configuration
- Add shared buildUrl utility for protocol handling

* Match JDBC telemetry payload format

- Change payload structure to match JDBC: uploadTime, items, protoLogs
- protoLogs contains JSON-stringified TelemetryFrontendLog objects
- Remove workspace_id (JDBC doesn't populate it)
- Remove debug logs added during testing

* Fix lint errors

- Fix import order in FeatureFlagCache
- Replace require() with import for driverVersion
- Fix variable shadowing
- Disable prefer-default-export for urlUtils

* Add telemetry infrastructure: CircuitBreaker and FeatureFlagCache

This is part 2 of 7 in the telemetry implementation stack.

Components:
- CircuitBreaker: Per-host endpoint protection with state management
- FeatureFlagCache: Per-host feature flag caching with reference counting
- CircuitBreakerRegistry: Manages circuit breakers per host

Circuit Breaker:
- States: CLOSED (normal), OPEN (failing), HALF_OPEN (testing recovery)
- Default: 5 failures trigger OPEN, 60s timeout, 2 successes to CLOSE
- Per-host isolation prevents cascade failures
- All state transitions logged at debug level

Feature Flag Cache:
- Per-host caching with 15-minute TTL
- Reference counting for connection lifecycle management
- Automatic cache expiration and refetch
- Context removed when refCount reaches zero

Testing:
- 32 comprehensive unit tests for CircuitBreaker
- 29 comprehensive unit tests for FeatureFlagCache
- 100% function coverage, >80% line/branch coverage
- CircuitBreakerStub for testing other components

Dependencies:
- Builds on [1/7] Types and Exception Classifier

* Add telemetry client management: TelemetryClient and Provider

This is part 3 of 7 in the telemetry implementation stack.

Components:
- TelemetryClient: HTTP client for telemetry export per host
- TelemetryClientProvider: Manages per-host client lifecycle with reference counting

TelemetryClient:
- Placeholder HTTP client for telemetry export
- Per-host isolation for connection pooling
- Lifecycle management (open/close)
- Ready for future HTTP implementation

TelemetryClientProvider:
- Reference counting tracks connections per host
- Automatically creates clients on first connection
- Closes and removes clients when refCount reaches zero
- Thread-safe per-host management

Design Pattern:
- Follows JDBC driver pattern for resource management
- One client per host, shared across connections
- Efficient resource utilization
- Clean lifecycle management

Testing:
- 31 comprehensive unit tests for TelemetryClient
- 31 comprehensive unit tests for TelemetryClientProvider
- 100% function coverage, >80% line/branch coverage
- Tests verify reference counting and lifecycle

Dependencies:
- Builds on [1/7] Types and [2/7] Infrastructure

* Add telemetry event emission and aggregation

This is part 4 of 7 in the telemetry implementation stack.

Components:
- TelemetryEventEmitter: Event-based telemetry emission using Node.js EventEmitter
- MetricsAggregator: Per-statement aggregation with batch processing

TelemetryEventEmitter:
- Event-driven architecture using Node.js EventEmitter
- Type-safe event emission methods
- Respects telemetryEnabled configuration flag
- All exceptions swallowed and logged at debug level
- Zero impact when disabled

Event Types:
- connection.open: On successful connection
- statement.start: On statement execution
- statement.complete: On statement finish
- cloudfetch.chunk: On chunk download
- error: On exception with terminal classification

MetricsAggregator:
- Per-statement aggregation by statement_id
- Connection events emitted immediately (no aggregation)
- Statement events buffered until completeStatement() called
- Terminal exceptions flushed immediately
- Retryable exceptions buffered until statement complete
- Batch size (default 100) triggers flush
- Periodic timer (default 5s) triggers flush

Batching Strategy:
- Optimizes export efficiency
- Reduces HTTP overhead
- Smart flushing based on error criticality
- Memory efficient with bounded buffers

Testing:
- 31 comprehensive unit tests for TelemetryEventEmitter
- 32 comprehensive unit tests for MetricsAggregator
- 100% function coverage, >90% line/branch coverage
- Tests verify exception swallowing
- Tests verify debug-only logging

Dependencies:
- Builds on [1/7] Types, [2/7] Infrastructure, [3/7] Client Management

* Add authentication support for REST API calls

Implements getAuthHeaders() method for authenticated REST API requests:
- Added getAuthHeaders() to IClientContext interface
- Implemented in DBSQLClient using authProvider.authenticate()
- Updated FeatureFlagCache to fetch from connector-service API with auth
- Added driver version support for version-specific feature flags
- Replaced placeholder implementation with actual REST API calls

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

* Fix feature flag and telemetry export endpoints

- Change feature flag endpoint to use NODEJS client type
- Fix telemetry endpoints to /telemetry-ext and /telemetry-unauth
- Update payload to match proto with system_configuration
- Add shared buildUrl utility for protocol handling

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

* Match JDBC telemetry payload format

- Change payload structure to match JDBC: uploadTime, items, protoLogs
- protoLogs contains JSON-stringified TelemetryFrontendLog objects
- Remove workspace_id (JDBC doesn't populate it)
- Remove debug logs added during testing

* Fix lint errors

- Fix import order in FeatureFlagCache
- Replace require() with import for driverVersion
- Fix variable shadowing
- Disable prefer-default-export for urlUtils

* Add telemetry infrastructure: CircuitBreaker and FeatureFlagCache

This is part 2 of 7 in the telemetry implementation stack.

Components:
- CircuitBreaker: Per-host endpoint protection with state management
- FeatureFlagCache: Per-host feature flag caching with reference counting
- CircuitBreakerRegistry: Manages circuit breakers per host

Circuit Breaker:
- States: CLOSED (normal), OPEN (failing), HALF_OPEN (testing recovery)
- Default: 5 failures trigger OPEN, 60s timeout, 2 successes to CLOSE
- Per-host isolation prevents cascade failures
- All state transitions logged at debug level

Feature Flag Cache:
- Per-host caching with 15-minute TTL
- Reference counting for connection lifecycle management
- Automatic cache expiration and refetch
- Context removed when refCount reaches zero

Testing:
- 32 comprehensive unit tests for CircuitBreaker
- 29 comprehensive unit tests for FeatureFlagCache
- 100% function coverage, >80% line/branch coverage
- CircuitBreakerStub for testing other components

Dependencies:
- Builds on [1/7] Types and Exception Classifier

* Add telemetry client management: TelemetryClient and Provider

This is part 3 of 7 in the telemetry implementation stack.

Components:
- TelemetryClient: HTTP client for telemetry export per host
- TelemetryClientProvider: Manages per-host client lifecycle with reference counting

TelemetryClient:
- Placeholder HTTP client for telemetry export
- Per-host isolation for connection pooling
- Lifecycle management (open/close)
- Ready for future HTTP implementation

TelemetryClientProvider:
- Reference counting tracks connections per host
- Automatically creates clients on first connection
- Closes and removes clients when refCount reaches zero
- Thread-safe per-host management

Design Pattern:
- Follows JDBC driver pattern for resource management
- One client per host, shared across connections
- Efficient resource utilization
- Clean lifecycle management

Testing:
- 31 comprehensive unit tests for TelemetryClient
- 31 comprehensive unit tests for TelemetryClientProvider
- 100% function coverage, >80% line/branch coverage
- Tests verify reference counting and lifecycle

Dependencies:
- Builds on [1/7] Types and [2/7] Infrastructure

* Add telemetry event emission and aggregation

This is part 4 of 7 in the telemetry implementation stack.

Components:
- TelemetryEventEmitter: Event-based telemetry emission using Node.js EventEmitter
- MetricsAggregator: Per-statement aggregation with batch processing

TelemetryEventEmitter:
- Event-driven architecture using Node.js EventEmitter
- Type-safe event emission methods
- Respects telemetryEnabled configuration flag
- All exceptions swallowed and logged at debug level
- Zero impact when disabled

Event Types:
- connection.open: On successful connection
- statement.start: On statement execution
- statement.complete: On statement finish
- cloudfetch.chunk: On chunk download
- error: On exception with terminal classification

MetricsAggregator:
- Per-statement aggregation by statement_id
- Connection events emitted immediately (no aggregation)
- Statement events buffered until completeStatement() called
- Terminal exceptions flushed immediately
- Retryable exceptions buffered until statement complete
- Batch size (default 100) triggers flush
- Periodic timer (default 5s) triggers flush

Batching Strategy:
- Optimizes export efficiency
- Reduces HTTP overhead
- Smart flushing based on error criticality
- Memory efficient with bounded buffers

Testing:
- 31 comprehensive unit tests for TelemetryEventEmitter
- 32 comprehensive unit tests for MetricsAggregator
- 100% function coverage, >90% line/branch coverage
- Tests verify exception swallowing
- Tests verify debug-only logging

Dependencies:
- Builds on [1/7] Types, [2/7] Infrastructure, [3/7] Client Management

* Add telemetry export: DatabricksTelemetryExporter

This is part 5 of 7 in the telemetry implementation stack.

Components:
- DatabricksTelemetryExporter: HTTP export with retry logic and circuit breaker
- TelemetryExporterStub: Test stub for integration tests

DatabricksTelemetryExporter:
- Exports telemetry metrics to Databricks via HTTP POST
- Two endpoints: authenticated (/api/2.0/sql/telemetry-ext) and unauthenticated (/api/2.0/sql/telemetry-unauth)
- Integrates with CircuitBreaker for per-host endpoint protection
- Retry logic with exponential backoff and jitter
- Exception classification (terminal vs retryable)

Export Flow:
1. Check circuit breaker state (skip if OPEN)
2. Execute with circuit breaker protection
3. Retry on retryable errors with backoff
4. Circuit breaker tracks success/failure
5. All exceptions swallowed and logged at debug level

Retry Strategy:
- Max retries: 3 (default, configurable)
- Exponential backoff: 100ms * 2^attempt
- Jitter: Random 0-100ms to prevent thundering herd
- Terminal errors: No retry (401, 403, 404, 400)
- Retryable errors: Retry with backoff (429, 500, 502, 503, 504)

Circuit Breaker Integration:
- Success → Record success with circuit breaker
- Failure → Record failure with circuit breaker
- Circuit OPEN → Skip export, log at debug
- Automatic recovery via HALF_OPEN state

Critical Requirements:
- All exceptions swallowed (NEVER throws)
- All logging at LogLevel.debug ONLY
- No console logging
- Driver continues when telemetry fails

Testing:
- 24 comprehensive unit tests
- 96% statement coverage, 84% branch coverage
- Tests verify exception swallowing
- Tests verify retry logic
- Tests verify circuit breaker integration
- TelemetryExporterStub for integration tests

Dependencies:
- Builds on all previous layers [1/7] through [4/7]

* Add authentication support for REST API calls

Implements getAuthHeaders() method for authenticated REST API requests:
- Added getAuthHeaders() to IClientContext interface
- Implemented in DBSQLClient using authProvider.authenticate()
- Updated FeatureFlagCache to fetch from connector-service API with auth
- Added driver version support for version-specific feature flags
- Replaced placeholder implementation with actual REST API calls

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

* Update DatabricksTelemetryExporter to use authenticated export

- Use getAuthHeaders() method for authenticated endpoint requests
- Remove TODO comments about missing authentication
- Add auth headers when telemetryAuthenticatedExport is true

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

* Fix feature flag endpoint and telemetry export

- Use NODEJS client type instead of OSS_NODEJS for feature flags
- Use /telemetry-ext and /telemetry-unauth (not /api/2.0/sql/...)
- Update payload to match proto: system_configuration with snake_case
- Add URL utility to handle protocol correctly

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

* Match JDBC telemetry payload format

- Change payload structure to match JDBC: uploadTime, items, protoLogs
- protoLogs contains JSON-stringified TelemetryFrontendLog objects
- Remove workspace_id (JDBC doesn't populate it)
- Remove debug logs added during testing

* Fix lint errors

- Fix import order in FeatureFlagCache
- Replace require() with import for driverVersion
- Fix variable shadowing
- Disable prefer-default-export for urlUtils

* Add telemetry infrastructure: CircuitBreaker and FeatureFlagCache

This is part 2 of 7 in the telemetry implementation stack.

Components:
- CircuitBreaker: Per-host endpoint protection with state management
- FeatureFlagCache: Per-host feature flag caching with reference counting
- CircuitBreakerRegistry: Manages circuit breakers per host

Circuit Breaker:
- States: CLOSED (normal), OPEN (failing), HALF_OPEN (testing recovery)
- Default: 5 failures trigger OPEN, 60s timeout, 2 successes to CLOSE
- Per-host isolation prevents cascade failures
- All state transitions logged at debug level

Feature Flag Cache:
- Per-host caching with 15-minute TTL
- Reference counting for connection lifecycle management
- Automatic cache expiration and refetch
- Context removed when refCount reaches zero

Testing:
- 32 comprehensive unit tests for CircuitBreaker
- 29 comprehensive unit tests for FeatureFlagCache
- 100% function coverage, >80% line/branch coverage
- CircuitBreakerStub for testing other components

Dependencies:
- Builds on [1/7] Types and Exception Classifier

* Add telemetry client management: TelemetryClient and Provider

This is part 3 of 7 in the telemetry implementation stack.

Components:
- TelemetryClient: HTTP client for telemetry export per host
- TelemetryClientProvider: Manages per-host client lifecycle with reference counting

TelemetryClient:
- Placeholder HTTP client for telemetry export
- Per-host isolation for connection pooling
- Lifecycle management (open/close)
- Ready for future HTTP implementation

TelemetryClientProvider:
- Reference counting tracks connections per host
- Automatically creates clients on first connection
- Closes and removes clients when refCount reaches zero
- Thread-safe per-host management

Design Pattern:
- Follows JDBC driver pattern for resource management
- One client per host, shared across connections
- Efficient resource utilization
- Clean lifecycle management

Testing:
- 31 comprehensive unit tests for TelemetryClient
- 31 comprehensive unit tests for TelemetryClientProvider
- 100% function coverage, >80% line/branch coverage
- Tests verify reference counting and lifecycle

Dependencies:
- Builds on [1/7] Types and [2/7] Infrastructure

* Add telemetry event emission and aggregation

This is part 4 of 7 in the telemetry implementation stack.

Components:
- TelemetryEventEmitter: Event-based telemetry emission using Node.js EventEmitter
- MetricsAggregator: Per-statement aggregation with batch processing

TelemetryEventEmitter:
- Event-driven architecture using Node.js EventEmitter
- Type-safe event emission methods
- Respects telemetryEnabled configuration flag
- All exceptions swallowed and logged at debug level
- Zero impact when disabled

Event Types:
- connection.open: On successful connection
- statement.start: On statement execution
- statement.complete: On statement finish
- cloudfetch.chunk: On chunk download
- error: On exception with terminal classification

MetricsAggregator:
- Per-statement aggregation by statement_id
- Connection events emitted immediately (no aggregation)
- Statement events buffered until completeStatement() called
- Terminal exceptions flushed immediately
- Retryable exceptions buffered until statement complete
- Batch size (default 100) triggers flush
- Periodic timer (default 5s) triggers flush

Batching Strategy:
- Optimizes export efficiency
- Reduces HTTP overhead
- Smart flushing based on error criticality
- Memory efficient with bounded buffers

Testing:
- 31 comprehensive unit tests for TelemetryEventEmitter
- 32 comprehensive unit tests for MetricsAggregator
- 100% function coverage, >90% line/branch coverage
- Tests verify exception swallowing
- Tests verify debug-only logging

Dependencies:
- Builds on [1/7] Types, [2/7] Infrastructure, [3/7] Client Management

* Add telemetry export: DatabricksTelemetryExporter

This is part 5 of 7 in the telemetry implementation stack.

Components:
- DatabricksTelemetryExporter: HTTP export with retry logic and circuit breaker
- TelemetryExporterStub: Test stub for integration tests

DatabricksTelemetryExporter:
- Exports telemetry metrics to Databricks via HTTP POST
- Two endpoints: authenticated (/api/2.0/sql/telemetry-ext) and unauthenticated (/api/2.0/sql/telemetry-unauth)
- Integrates with CircuitBreaker for per-host endpoint protection
- Retry logic with exponential backoff and jitter
- Exception classification (terminal vs retryable)

Export Flow:
1. Check circuit breaker state (skip if OPEN)
2. Execute with circuit breaker protection
3. Retry on retryable errors with backoff
4. Circuit breaker tracks success/failure
5. All exceptions swallowed and logged at debug level

Retry Strategy:
- Max retries: 3 (default, configurable)
- Exponential backoff: 100ms * 2^attempt
- Jitter: Random 0-100ms to prevent thundering herd
- Terminal errors: No retry (401, 403, 404, 400)
- Retryable errors: Retry with backoff (429, 500, 502, 503, 504)

Circuit Breaker Integration:
- Success → Record success with circuit breaker
- Failure → Record failure with circuit breaker
- Circuit OPEN → Skip export, log at debug
- Automatic recovery via HALF_OPEN state

Critical Requirements:
- All exceptions swallowed (NEVER throws)
- All logging at LogLevel.debug ONLY
- No console logging
- Driver continues when telemetry fails

Testing:
- 24 comprehensive unit tests
- 96% statement coverage, 84% branch coverage
- Tests verify exception swallowing
- Tests verify retry logic
- Tests verify circuit breaker integration
- TelemetryExporterStub for integration tests

Dependencies:
- Builds on all previous layers [1/7] through [4/7]

* Add telemetry integration into driver components

This is part 6 of 7 in the telemetry implementation stack.

Integration Points:
- DBSQLClient: Telemetry lifecycle management and configuration
- DBSQLOperation: Statement event emissions
- DBSQLSession: Session ID propagation
- CloudFetchResultHandler: Chunk download events
- IDBSQLClient: ConnectionOptions override support

DBSQLClient Integration:
- initializeTelemetry(): Initialize all telemetry components
- Feature flag check via FeatureFlagCache
- Create TelemetryClientProvider, EventEmitter, MetricsAggregator, Exporter
- Wire event listeners between emitter and aggregator
- Cleanup on close(): Flush metrics, release clients, release feature flag context
- Override support via ConnectionOptions.telemetryEnabled

Event Emission Points:
- connection.open: On successful openSession() with driver config
- statement.start: In DBSQLOperation constructor
- statement.complete: In DBSQLOperation.close()
- cloudfetch.chunk: In CloudFetchResultHandler.downloadLink()
- error: In DBSQLOperation.emitErrorEvent() with terminal classification

Session ID Propagation:
- DBSQLSession passes sessionId to DBSQLOperation constructor
- All events include sessionId for correlation
- Statement events include both sessionId and statementId

Error Handling:
- All telemetry code wrapped in try-catch
- All exceptions logged at LogLevel.debug ONLY
- Driver NEVER throws due to telemetry failures
- Zero impact on driver operations

Configuration Override:
- ConnectionOptions.telemetryEnabled overrides config
- Per-connection control for testing
- Respects feature flag when override not specified

Testing:
- Integration test suite: 11 comprehensive E2E tests
- Tests verify full telemetry flow: connection → statement → export
- Tests verify feature flag behavior
- Tests verify driver works when telemetry fails
- Tests verify no exceptions propagate

Dependencies:
- Builds on all previous layers [1/7] through [5/7]
- Completes the telemetry data flow pipeline

* Add authentication support for REST API calls

Implements getAuthHeaders() method for authenticated REST API requests:
- Added getAuthHeaders() to IClientContext interface
- Implemented in DBSQLClient using authProvider.authenticate()
- Updated FeatureFlagCache to fetch from connector-service API with auth
- Added driver version support for version-specific feature flags
- Replaced placeholder implementation with actual REST API calls

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

* Update DatabricksTelemetryExporter to use authenticated export

- Use getAuthHeaders() method for authenticated endpoint requests
- Remove TODO comments about missing authentication
- Add auth headers when telemetryAuthenticatedExport is true

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

* Fix telemetry event listeners and add config options

- Fix event listener names: Remove 'telemetry.' prefix
- Add support for telemetryBatchSize and telemetryAuthenticatedExport config options
- Update telemetry files with fixed endpoints and proto format

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

* Match JDBC telemetry payload format

- Change payload structure to match JDBC: uploadTime, items, protoLogs
- protoLogs contains JSON-stringified TelemetryFrontendLog objects
- Remove workspace_id (JDBC doesn't populate it)
- Remove debug logs added during testing

* Fix lint errors

- Fix import order in FeatureFlagCache
- Replace require() with import for driverVersion
- Fix variable shadowing
- Disable prefer-default-export for urlUtils

* Add missing getAuthHeaders method to ClientContextStub

Fix TypeScript compilation error by implementing getAuthHeaders
method required by IClientContext interface.

* Add missing getAuthHeaders method to ClientContextStub

Fix TypeScript compilation error by implementing getAuthHeaders
method required by IClientContext interface.

* Add missing getAuthHeaders method to ClientContextStub

Fix TypeScript compilation error by implementing getAuthHeaders
method required by IClientContext interface.

* Add missing getAuthHeaders method to ClientContextStub

Fix TypeScript compilation error by implementing getAuthHeaders
method required by IClientContext interface.

* Add missing getAuthHeaders method to ClientContextStub

Fix TypeScript compilation error by implementing getAuthHeaders
method required by IClientContext interface.

* Fix prettier formatting

* Fix prettier formatting

* Fix prettier formatting

* Fix prettier formatting

* Fix prettier formatting

* Use nodejs-sql-driver as driver name in telemetry

Changed from '@databricks/sql' to 'nodejs-sql-driver' to match
JDBC driver naming convention. Added DRIVER_NAME constant to types.ts.

* Add DRIVER_NAME constant for nodejs-sql-driver

* Add DRIVER_NAME constant for nodejs-sql-driver

* Add DRIVER_NAME constant for nodejs-sql-driver

* Add DRIVER_NAME constant for nodejs-sql-driver

* Add missing telemetry fields to match JDBC

Added osArch, runtimeVendor, localeName, charSetEncoding, and
processName fields to DriverConfiguration to match JDBC implementation.

* Add missing telemetry fields to match JDBC

Added osArch, runtimeVendor, localeName, charSetEncoding, and
processName fields to DriverConfiguration to match JDBC implementation.

* Populate all telemetry system configuration fields

Added helper methods to populate osArch, runtimeVendor, localeName,
charSetEncoding, and processName to match JDBC implementation:
- osArch: from os.arch()
- runtimeVendor: 'Node.js Foundation'
- localeName: from LANG env var (format: en_US)
- charSetEncoding: UTF-8 (Node.js default)
- processName: from process.title or script name

* Add missing telemetry fields to match JDBC

Added osArch, runtimeVendor, localeName, charSetEncoding, and
processName fields to DriverConfiguration to match JDBC implementation.

* Add missing telemetry fields to match JDBC

Added osArch, runtimeVendor, localeName, charSetEncoding, and
processName fields to DriverConfiguration to match JDBC implementation.

* Add missing telemetry fields to match JDBC

Added osArch, runtimeVendor, localeName, charSetEncoding, and
processName fields to DriverConfiguration to match JDBC implementation.

* Fix telemetry aggregator cleanup in client close

Changed from flush() to close() to ensure:
- Periodic flush timer is stopped
- Incomplete statements are finalized
- Final flush is performed

Previously, only flush() was called which left the timer running
and didn't complete remaining statements.

* Fix TypeScript compilation: add missing fields to system_configuration interface

* Fix TypeScript compilation: add missing fields to system_configuration interface

* Fix TypeScript compilation: add missing fields to system_configuration interface

* Fix TypeScript compilation: add missing fields to system_configuration interface

* Add telemetry testing and documentation

This is part 7 of 7 in the telemetry implementation stack - FINAL LAYER.

Documentation:
- README.md: Add telemetry overview section
- docs/TELEMETRY.md: Comprehensive telemetry documentation
- spec/telemetry-design.md: Detailed design document
- spec/telemetry-sprint-plan.md: Implementation plan
- spec/telemetry-test-completion-summary.md: Test coverage report

README.md Updates:
- Added telemetry overview section
- Configuration examples with all 7 options
- Privacy-first design highlights
- Link to detailed TELEMETRY.md

TELEMETRY.md - Complete User Guide:
- Overview and benefits
- Privacy-first design (what is/isn't collected)
- Configuration guide with examples
- Event types with JSON schemas
- Feature control (server-side flag + client override)
- Architecture overview
- Troubleshooting guide
- Privacy & compliance (GDPR, CCPA, SOC 2)
- Performance impact analysis
- FAQ (12 common questions)

Design Document (telemetry-design.md):
- Complete system architecture
- Component specifications
- Data flow diagrams
- Error handling requirements
- Testing strategy
- Implementation phases

Test Coverage Summary:
- 226 telemetry tests passing
- 97.76% line coverage
- 90.59% branch coverage
- 100% function coverage
- Critical requirements verified

Test Breakdown by Component:
- ExceptionClassifier: 51 tests (100% coverage)
- CircuitBreaker: 32 tests (100% functions)
- FeatureFlagCache: 29 tests (100% functions)
- TelemetryEventEmitter: 31 tests (100% functions)
- TelemetryClient: 31 tests (100% functions)
- TelemetryClientProvider: 31 tests (100% functions)
- MetricsAggregator: 32 tests (94% lines, 82% branches)
- DatabricksTelemetryExporter: 24 tests (96% statements)
- Integration: 11 E2E tests

Critical Test Verification:
✅ All exceptions swallowed (no propagation)
✅ Debug-only logging (no warn/error)
✅ No console logging
✅ Driver works when telemetry fails
✅ Reference counting correct
✅ Circuit breaker behavior correct

This completes the 7-layer telemetry implementation stack!

Signed-off-by: samikshya-chand_data <samikshya.chand@databricks.com>

* Add authentication support for REST API calls in telemetry

Implement proper authentication for feature flag fetching and telemetry
export by adding getAuthHeaders() method to IClientContext.

- **IClientContext**: Add getAuthHeaders() method to expose auth headers
- **DBSQLClient**: Implement getAuthHeaders() using authProvider.authenticate()
- Returns empty object gracefully if no auth provider available

- **FeatureFlagCache**: Implement actual server API call
- Endpoint: GET /api/2.0/connector-service/feature-flags/OSS_NODEJS/{version}
- Uses context.getAuthHeaders() for authentication
- Parses JSON response with flags array
- Updates cache duration from server-provided ttl_seconds
- Looks for: databricks.partnerplatform.clientConfigsFeatureFlags.enableTelemetryForNodeJs
- All exceptions swallowed with debug logging only

- **DatabricksTelemetryExporter**: Add authentication to authenticated endpoint
- Uses context.getAuthHeaders() when authenticatedExport=true
- Properly authenticates POST to /api/2.0/sql/telemetry-ext
- Removes TODO comments about missing authentication

Follows same pattern as JDBC driver:
- Endpoint: /api/2.0/connector-service/feature-flags/OSS_JDBC/{version} (JDBC)
- Endpoint: /api/2.0/connector-service/feature-flags/OSS_NODEJS/{version} (Node.js)
- Auth headers from connection's authenticate() method
- Response format: { flags: [{ name, value }], ttl_seconds }

- Build: ✅ Successful
- E2E: ✅ Verified with real credentials
- Feature flag fetch now fully functional
- Telemetry export now properly authenticated

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Signed-off-by: samikshya-chand_data <samikshya.chand@databricks.com>

* Fix telemetry and feature flag implementation

- Fix event listener names: use 'connection.open' not 'telemetry.connection.open'
- Fix feature flag endpoint: use NODEJS client type instead of OSS_NODEJS
- Fix telemetry endpoints: use /telemetry-ext and /telemetry-unauth (not /api/2.0/sql/...)
- Update telemetry payload to match proto: use system_configuration with snake_case fields
- Add URL utility to handle hosts with or without protocol
- Add telemetryBatchSize and telemetryAuthenticatedExport config options
- Remove debug statements and temporary feature flag override

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Signed-off-by: samikshya-chand_data <samikshya.chand@databricks.com>

* Fix prettier formatting

Signed-off-by: samikshya-chand_data <samikshya.chand@databricks.com>

* Update telemetry design doc with system config and protoLogs format

Added detailed documentation for:
- System configuration fields (osArch, runtimeVendor, localeName,
  charSetEncoding, processName) with JDBC equivalents
- protoLogs payload format matching JDBC TelemetryRequest structure
- Complete log object structure with all field descriptions
- Example JSON payloads showing actual format sent to server

Clarified that:
- Each log is JSON-stringified before adding to protoLogs array
- Connection events include full system_configuration
- Statement events include operation_latency_ms and sql_operation
- The items field is required but always empty

Signed-off-by: samikshya-chand_data <samikshya.chand@databricks.com>

* Document telemetry export lifecycle and timing

Added comprehensive section 6.5 explaining exactly when telemetry
exports occur:

- Statement close: Aggregates metrics, exports only if batch full
- Connection close: ALWAYS exports all pending metrics via aggregator.close()
- Process exit: NO automatic export unless close() was called
- Batch size/timer: Automatic background exports

Included:
- Code examples showing actual implementation
- Summary table comparing all lifecycle events
- Best practices for ensuring telemetry export (SIGINT/SIGTERM handlers)
- Key differences from JDBC (JVM shutdown hooks vs manual close)

Clarified that aggregator.close() does three things:
1. Stops the periodic flush timer
2. Completes any remaining incomplete statements
3. Performs final flush to export all buffered metrics

Signed-off-by: samikshya-chand_data <samikshya.chand@databricks.com>

* Add connection open latency tracking and enable telemetry by default

Changes:
- Track and export connection open latency (session creation time)
- Enable telemetry by default (was false), gated by feature flag
- Update design doc to document connection latency

Implementation:
- DBSQLClient.openSession(): Track start time and calculate latency
- TelemetryEventEmitter: Accept latencyMs in connection event
- MetricsAggregator: Include latency in connection metrics
- DatabricksTelemetryExporter: Export operation_latency_ms for connections

Config changes:
- telemetryEnabled: true by default (in DBSQLClient and types.ts)
- Feature flag check still gates initialization for safe rollout

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Signed-off-by: samikshya-chand_data <samikshya.chand@databricks.com>

* Populate sql_operation, statement_id, and auth_type in telemetry

Fixes:
- sql_operation now properly populated by fetching metadata before statement close
- statement_id always populated from operation handle GUID
- auth_type now included in driver_connection_params

Changes:
- DBSQLOperation: Fetch metadata before emitting statement.complete to ensure
  resultFormat is available for sql_operation field
- DBSQLClient: Track authType from connection options and include in
  driver configuration
- DatabricksTelemetryExporter: Export auth_type in driver_connection_params
- types.ts: Add authType to DriverConfiguration interface
- Design doc: Document auth_type, resultFormat population, and connection params

Implementation details:
- emitStatementComplete() is now async to await metadata fetch
- Auth type defaults to 'access-token' if not specified
- Result format fetched even if not explicitly requested by user
- Handles metadata fetch failures gracefully (continues without resultFormat)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Signed-off-by: samikshya-chand_data <samikshya.chand@databricks.com>

* Map auth type to telemetry auth enum

- Convert 'access-token' (or undefined) to 'pat'
- Convert 'databricks-oauth' to 'external-browser' (U2M) or 'oauth-m2m' (M2M)
- Distinguish M2M from U2M by checking for oauthClientSecret
- Keep 'custom' as 'custom'

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

* Add SqlExecutionEvent fields to telemetry

- Add statement_type field from operationType
- Add is_compressed field from compression tracking
- Export both fields in sql_operation for statement metrics
- Fields populated from CloudFetch chunk events

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

* Filter out NIL UUID from statement ID in telemetry

- Exclude '00000000-0000-0000-0000-000000000000' from sql_statement_id
- Only include valid statement IDs in telemetry logs

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

* Only populate sql_operation fields when present

- statement_type only included if operationType is set
- is_compressed only included if compressed value is set
- execution_result only included if resultFormat is set
- sql_operation object only created if any field is present

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

* Map Thrift operation type to proto Operation.Type enum

- Convert TOperationType (Thrift) to proto Operation.Type names
- EXECUTE_STATEMENT remains EXECUTE_STATEMENT
- GET_TYPE_INFO -> LIST_TYPE_INFO
- GET_CATALOGS -> LIST_CATALOGS
- GET_SCHEMAS -> LIST_SCHEMAS
- GET_TABLES -> LIST_TABLES
- GET_TABLE_TYPES -> LIST_TABLE_TYPES
- GET_COLUMNS -> LIST_COLUMNS
- GET_FUNCTIONS -> LIST_FUNCTIONS
- UNKNOWN -> TYPE_UNSPECIFIED

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

* Move auth_type to top level per proto definition

- auth_type is field 5 at OssSqlDriverTelemetryLog level, not nested
- Remove driver_connection_params (not populated in Node.js driver)
- Export auth_type directly in sql_driver_log for connection events

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

* Map result format to proto ExecutionResult.Format enum

- ARROW_BASED_SET -> INLINE_ARROW
- COLUMN_BASED_SET -> COLUMNAR_INLINE
- ROW_BASED_SET -> INLINE_JSON
- URL_BASED_SET -> EXTERNAL_LINKS

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

* Refactor telemetry type mappers to separate file

- Create lib/telemetry/telemetryTypeMappers.ts
- Move mapOperationTypeToTelemetryType (renamed from mapOperationTypeToProto)
- Move mapResultFormatToTelemetryType (renamed from mapResultFormatToProto)
- Keep all telemetry-specific mapping functions in one place

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

* Add driver_connection_params with available fields

- http_path: API endpoint path
- socket_timeout: Connection timeout in milliseconds
- enable_arrow: Whether Arrow format is enabled
- enable_direct_results: Whether direct results are enabled
- enable_metric_view_metadata: Whether metric view metadata is enabled
- Only populate fields that are present

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

* Document proto field coverage in design doc

- Add section 14 detailing implemented and missing proto fields
- List all fields from OssSqlDriverTelemetryLog that are implemented
- Document which fields are not implemented and why
- Explain that missing fields require additional instrumentation

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

* Include system_configuration, driver_connection_params, and auth_type in all telemetry logs

- Cache driver config in MetricsAggregator when connection event is processed
- Include cached driver config in all statement and error metrics
- Export system_configuration, driver_connection_params, and auth_type for every log
- Each telemetry log is now self-contained with full context

This ensures every telemetry event (connection, statement, error) includes
the driver configuration context, making logs independently analyzable.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

* Add connection close telemetry event

Implement CONNECTION_CLOSE telemetry event to track session lifecycle:
- Add CONNECTION_CLOSE event type to TelemetryEventType enum
- Add emitConnectionClose() method to TelemetryEventEmitter
- Add processConnectionCloseEvent() handler in MetricsAggregator
- Track session open time in DBSQLSession and emit close event with latency
- Remove unused TOperationType import from DBSQLOperation

This provides complete session telemetry: connection open, statement execution,
and connection close with latencies for each operation.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

* Fix unit tests for connection close telemetry

Update test files to match new telemetry interface changes:
- Add latencyMs parameter to all emitConnectionOpen() test calls
- Add missing DriverConfiguration fields in test mocks (osArch,
  runtimeVendor, localeName, charSetEncoding, authType, processName)

This fixes TypeScript compilation errors introduced by the connection
close telemetry implementation.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

* Add connection.close event listener to telemetry wire-up

Fix missing event listener for CONNECTION_CLOSE events in DBSQLClient
telemetry initialization. Without this listener, connection close events
were being emitted but not routed to the aggregator for processing.

Now all 3 telemetry events are properly exported:
- CONNECTION_OPEN (connection latency)
- STATEMENT_COMPLETE (execution latency)
- CONNECTION_CLOSE (session duration)

Verified with e2e test showing 3 successful telemetry exports.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

* Make telemetry logging silent by default

Remove verbose telemetry logs to minimize noise in customer logs.
Only log essential startup/shutdown messages and errors:

Kept (LogLevel.debug):
- "Telemetry: enabled" - on successful initialization
- "Telemetry: disabled" - when feature flag disables it
- "Telemetry: closed" - on graceful shutdown
- Error messages only when failures occur

Removed:
- Individual metric flushing logs
- Export operation logs ("Exporting N metrics")
- Success confirmations ("Successfully exported")
- Client lifecycle logs (creation, ref counting)
- All intermediate operational logs

Updated spec/telemetry-design.md to document the silent logging policy.

Telemetry still functions correctly - exports happen silently in the
background without cluttering customer logs.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

* Ensure statement_type always populated in telemetry

Fix issue where statement_type was null in telemetry payloads.

Changes:
- mapOperationTypeToTelemetryType() now always returns a string,
  defaulting to 'TYPE_UNSPECIFIED' when operationType is undefined
- statement_type always included in sql_operation telemetry log

This ensures that even if the Thrift operationHandle doesn't have
operationType set, the telemetry will include 'TYPE_UNSPECIFIED'
instead of null.

Root cause: operationHandle.operationType from Thrift response can
be undefined, resulting in null statement_type in telemetry logs.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

* Add operation types to connection metrics

Connection metrics now include operation type in sql_operation:
- CREATE_SESSION for connection open events
- DELETE_SESSION for connection close events

This matches the proto Operation.Type enum which includes session-level
operations in addition to statement-level operations.

Before:
  sql_operation: null

After:
  sql_operation: {
    statement_type: "CREATE_SESSION"  // or "DELETE_SESSION"
  }

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

* Fix telemetry proto field mapping

Correct issue where Operation.Type values were incorrectly placed in
statement_type field. Per proto definition:

- statement_type expects Statement.Type (QUERY, SQL, UPDATE, METADATA, VOLUME)
- operation_type goes in operation_detail.operation_type and uses Operation.Type

Changes:
- Connection metrics: Set sql_operation.operation_detail.operation_type to
  CREATE_SESSION or DELETE_SESSION
- Statement metrics: Set both statement_type (QUERY or METADATA based on
  operation) and operation_detail.operation_type (EXECUTE_STATEMENT, etc.)
- Added mapOperationToStatementType() to convert Operation.Type to Statement.Type

This ensures telemetry payloads match the OssSqlDriverTelemetryLog proto
structure correctly.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

* Add operation_detail field to telemetry interface and enhance test

- Added operation_detail field to DatabricksTelemetryLog interface
- Enhanced telemetry-local.test.ts to capture and display actual payloads
- Verified all three telemetry events (CONNECTION_OPEN, STATEMENT_COMPLETE, CONNECTION_CLOSE)
- Confirmed statement_type and operation_detail.operation_type are properly populated

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

* Add error scenario test for telemetry validation

- Added test for invalid query execution (TABLE_OR_VIEW_NOT_FOUND)
- Confirms SQL execution errors are handled as failed statements
- Verified telemetry payloads still correctly formatted during errors
- Note: Driver-level errors (connection/timeout) would need emitErrorEvent wiring

Test output shows correct behavior:
- CONNECTION_OPEN with CREATE_SESSION
- STATEMENT_COMPLETE with QUERY + EXECUTE_STATEMENT (even on error)
- CONNECTION_CLOSE with DELETE_SESSION

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

* Fix telemetry PR review comments from #325

Three fixes addressing review feedback:

1. Fix documentation typo (sreekanth-db comment)
   - DatabricksTelemetryExporter.ts:94
   - Changed "TelemetryFrontendLog" to "DatabricksTelemetryLog"

2. Add proxy support (jadewang-db comment)
   - DatabricksTelemetryExporter.ts:exportInternal()
   - Get HTTP agent from connection provider
   - Pass agent to fetch for proxy support
   - Follows same pattern as CloudFetchResultHandler and DBSQLSession
   - Supports http/https/socks proxies with authentication

3. Fix flush timer to prevent rate limiting (sreekanth-db comment)
   - MetricsAggregator.ts:flush()
   - Reset timer after manual flushes (batch size, terminal errors)
   - Ensures consistent 30s spacing between exports
   - Prevents rapid successive flushes (e.g., batch at 25s, timer at 30s)

All changes follow existing driver patterns and maintain backward compatibility.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

* Fix telemetry PR review comments from #325

Three fixes addressing review feedback:

1. Fix documentation typo (sreekanth-db comment)
   - DatabricksTelemetryExporter.ts:94
   - Changed "TelemetryFrontendLog" to "DatabricksTelemetryLog"

2. Add proxy support (jadewang-db comment)
   - DatabricksTelemetryExporter.ts:exportInternal()
   - Get HTTP agent from connection provider
   - Pass agent to fetch for proxy support
   - Follows same pattern as CloudFetchResultHandler and DBSQLSession
   - Supports http/https/socks proxies with authentication

3. Fix flush timer to prevent rate limiting (sreekanth-db comment)
   - MetricsAggregator.ts:flush()
   - Reset timer after manual flushes (batch size, terminal errors)
   - Ensures consistent 30s spacing between exports
   - Prevents rapid successive flushes (e.g., batch at 25s, timer at 30s)

All changes follow existing driver patterns and maintain backward compatibility.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

* Add proxy support to feature flag fetching

Feature flag fetching was also missing proxy support like telemetry
exporter was. Applied the same fix:

- Get HTTP agent from connection provider
- Pass agent to fetch call for proxy support
- Follows same pattern as CloudFetchResultHandler and DBSQLSession
- Supports http/https/socks proxies with authentication

This completes proxy support for all HTTP operations in the telemetry
system (both telemetry export and feature flag fetching).

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

* Add proxy support to feature flag fetching

Feature flag fetching was also missing proxy support like telemetry
exporter was. Applied the same fix:

- Get HTTP agent from connection provider
- Pass agent to fetch call for proxy support
- Follows same pattern as CloudFetchResultHandler and DBSQLSession
- Supports http/https/socks proxies with authentication

This completes proxy support for all HTTP operations in the telemetry
system (both telemetry export and feature flag fetching).

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

* Make feature flag cache extensible for multiple flags

Refactored FeatureFlagCache to support querying any feature flag,
not just the telemetry flag:

**Changes:**
- Store all flags from server in Map<string, string>
- Add generic isFeatureEnabled(host, flagName) method
- Keep isTelemetryEnabled() as convenience method
- fetchFeatureFlags() now stores all flags for future use

**Benefits:**
- Extensible to any safe feature flag
- No code changes needed to add new flags
- Single fetch stores all flags from response
- Backward compatible (isTelemetryEnabled still works)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

* Add circuit breaker protection to feature flag fetching

Feature flags now use the same circuit breaker protection as telemetry
for resilience against endpoint failures.

**Changes:**
- FeatureFlagCache now accepts optional CircuitBreakerRegistry
- Feature flag fetches wrapped in circuit breaker execution
- Shared circuit breaker registry between feature flags and telemetry
- Per-host circuit breaker isolation maintained
- Falls back to cached values when circuit is OPEN

**Benefits:**
- Protects against repeated failures to feature flag endpoint
- Fails fast when endpoint is down (circuit OPEN)
- Auto-recovery after timeout (60s default)
- Same resilience patterns as telemetry export

**Configuration:**
- Failure threshold: 5 consecutive failures
- Timeout: 60 seconds
- Per-host isolation (failures on one host don't affect others)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

* Fix telemetry unit tests for extensible feature flags and proto payload format

- Update FeatureFlagCache tests to use new extensible flags Map
- Fix DatabricksTelemetryExporter tests to use protoLogs format
- Verify telemetry endpoints use correct paths (/telemetry-ext, /telemetry-unauth)
- 213 passing, 13 logging assertion tests need investigation

* Merge main into telemetry-3-client-management; prefer main versions for infrastructure

Takes main's versions of CircuitBreaker, FeatureFlagCache, DatabricksTelemetryExporter,
MetricsAggregator, TelemetryEventEmitter, and types — bringing in SSRF hardening,
overflow protection, async flush/close, errorStack redaction, and IAuthentication-based
auth headers. Removes IClientContext.getAuthHeaders() in favour of the direct
authProvider injection pattern from main.

Co-authored-by: samikshya-chand_data

* fix: prettier formatting and add coverage/ to .prettierignore

Co-authored-by: samikshya-chand_data

* fix: delete host entry before awaiting close to prevent stale-client race

If getOrCreateClient ran after refCount hit 0 but before clients.delete(host),
it would receive a closing TelemetryClient. Deleting synchronously first means
any concurrent caller gets a fresh instance instead.

Co-authored-by: samikshya-chand_data

* refactor: simplify TelemetryClient.close() and TelemetryClientProvider

- close() made synchronous (no I/O, no reason for async); uses finally to
  guarantee this.closed=true even when logger throws
- TelemetryClientProvider.releaseClient() made synchronous for the same reason
- Misleading ConcurrentHashMap reference removed from docstring
- getRefCount/getActiveClients marked @internal (test-only surface)
- Update tests to match: rejects→throws stubs, remove await on sync calls

Co-authored-by: samikshya-chand_data

* chore: move telemetry MD docs to follow-up PR

Remove docs/TELEMETRY.md, spec/telemetry-design.md,
spec/telemetry-sprint-plan.md, spec/telemetry-test-completion-summary.md
and revert README.md — these ~4.9k lines of markdown are being
split into a stacked docs-only PR on top of this branch to keep the
[4/7] diff focused on code.

Co-authored-by: Isaac
Signed-off-by: samikshya-chand_data <samikshya.chand@databricks.com>

* chore: drop tests/e2e/telemetry-local.test.ts

Local-only e2e harness that duplicates what tests/e2e/telemetry/telemetry-integration.test.ts covers in CI.

Co-authored-by: Isaac
Signed-off-by: samikshya-chand_data <samikshya.chand@databricks.com>

* chore: remove unused CircuitBreaker import in FeatureFlagCache

Co-authored-by: Isaac

* chore: apply prettier formatting to telemetry files

Fixes the lint CI job which runs prettier --check before eslint.

Co-authored-by: Isaac
Signed-off-by: samikshya-chand_data <samikshya.chand@databricks.com>

* feat: emit per-chunk timing telemetry on the FetchResults path

Aggregate initial/slowest/sum chunk latencies in MetricsAggregator,
emit them in the proto chunk_details block, and time each FetchResults
RPC in RowSetProvider so the inline-Arrow path populates chunk
telemetry alongside CloudFetch (mirrors Go's chunkTimingAccumulator).

Also fix MetricsAggregator clobbering accumulated chunkCount and
bytesDownloaded to 0 on STATEMENT_COMPLETE when the event omitted
those fields — this hid chunk_details from every path.

Co-authored-by: Isaac

* fix(telemetry): address review feedback on aggregator and close()

- Snapshot driverConfig on each statement at first event so a later
  CONNECTION_OPEN can't retroactively rewrite the config reported by
  in-flight statements (and their buffered errors).
- Attach a defensive .catch() to the fire-and-forget exporter.export()
  call so any future regression that leaks a rejection logs at debug
  rather than surfacing as an unhandled promise rejection.
- Document the unref()'d flush timer on DBSQLClient.close(): callers
  must await close() on shutdown to drain buffered telemetry; otherwise
  metrics between flush ticks are lost.

Co-authored-by: Isaac

* fix(telemetry): iter-2 review fixes — rebase regressions, type-safe wiring, error telemetry

Rebase the regressed exporter/aggregator/feature-flag-cache on main's
hardened versions and re-apply only the genuinely new functionality
(CONNECTION_CLOSE event, chunk-timing aggregation) on top. Closes the
critical findings from the multi-reviewer audit:

  - SSRF guard, redactSensitive, sanitizeProcessName, hasAuthorization,
    auth-missing warn-once — all restored via main's telemetryUtils.
  - MetricsAggregator memory bounds (maxPendingMetrics with error-preferring
    drop, maxErrorsPerStatement, statementTtlMs eviction) restored.
  - FeatureFlagCache in-flight fetch dedup and TTL clamp [60s, 3600s]
    restored; lib/telemetry/urlUtils.ts deleted.
  - close() now properly awaits aggregator drain — fixes the close()/flush
    race that PR #362 already fixed once.
  - Driver version reads lib/version.ts via buildUserAgentString instead
    of hardcoded '1.0.0'; uuidv4() restored in place of Math.random().
  - TelemetryTerminalError re-exported from lib/index.ts.

Type-safe wiring:

  - Added optional getTelemetryEmitter() / getTelemetryAggregator() to
    IClientContext; removed all 7 `(this.context as any)` casts.
  - Six copy-pasted event listeners in DBSQLClient.initializeTelemetry
    collapsed into one `Object.values(TelemetryEventType)` loop — closes
    the listener-name mismatch that silently dropped error events.
  - mapAuthType now covers all 6 authType values instead of defaulting
    everything to 'pat'.

TelemetryClient now owns the host-scoped resources:

  - TelemetryClientProvider is a process-wide singleton (getInstance()).
  - TelemetryClient owns DatabricksTelemetryExporter, MetricsAggregator,
    CircuitBreakerRegistry, and FeatureFlagCache for its host. Implements
    IClientContext itself so the owned components have a stable context
    that survives any single DBSQLClient's close.
  - DBSQLClient instances on the same host share the breaker counters,
    feature-flag cache, exporter, and HTTP batches. Fixes the per-instance
    breaker-fragmentation noted in iter-2 architecture review.
  - Each DBSQLClient still holds its own TelemetryEventEmitter (respects
    per-client telemetryEnabled); emitters bridge into the shared aggregator.
  - Exporter falls back to context.getAuthProvider() when no explicit auth
    provider is passed, so the shared exporter resolves auth through the
    TelemetryClient's FIFO of registered DBSQLClients.

Error telemetry wired across operation entry points:

  - Re-added emitErrorEvent(error) on DBSQLOperation; uses
    ExceptionClassifier.isTerminal() to classify.
  - fetchChunk, cancel, close, getMetadata wrap their bodies in try/catch
    that calls emitErrorEvent before re-throwing. Verified end-to-end
    against a real Azure Databricks workspace: failed query produces
    STATEMENT_COMPLETE + ERROR (with redacted stack) on the wire.
  - Removed the await getMetadata() call from emitStatementComplete —
    eliminates the extra Thrift RPC on every close (F19) AND prevents
    spurious error telemetry from getMetadata's wrapper firing during
    close-cleanup of an already-failed operation.

Other:

  - Iterating Map.keys() while mutating made safe via snapshot in close().
  - STATEMENT_COMPLETE no longer zeroes accumulated chunk metrics when
    the emit doesn't supply them (matches sibling-field guards).
  - Tests for the rebased modules restored from main; provider tests
    updated for the singleton API; deleted unused TelemetryExporterStub.

484 unit tests passing. Diff vs main: ~+2110/-383, down from the
original PR's +3640/-1173.

Co-authored-by: Isaac

* chore(telemetry): satisfy CI lint and prettier

- Apply prettier to TelemetryClientProvider.test.ts (sed-edits in the
  prior commit didn't preserve formatting).
- Silence eslint `no-await-in-loop` on the auth-context fall-through
  in TelemetryClient.getConnectionProvider — sequential by intent.
- Drop the empty public constructor on TelemetryClientProvider; leave
  a comment explaining the singleton + test-friendly construction
  contract.

Co-authored-by: Isaac

* chore(lint): disable func-names in test files

Mocha tests need `function () {}` so they can use `this.timeout()` /
`this.skip()` — arrow functions don't bind `this` to the test context.
The `func-names` rule was firing on every test in the suite (including
pre-existing tests in `protocol_versions.test.ts`); moving the rule to
the test-file override block silences those warnings.

Co-authored-by: Isaac

* fix(telemetry): address review findings — wiring, lifecycle, redaction, knobs

Iter-3 review fixes addressing 17 distinct findings from the multi-agent
review. Telemetry is now functionally correct and operationally safe.

Critical
- F1: TelemetryClient ctor wires getOrCreateContext on FeatureFlagCache.
  isTelemetryEnabled was previously short-circuiting to false in production
  because no caller registered the host — every customer silently emitted
  zero events.
- F2: integration test asserts the documented default (true), not the prior
  off-by-default. Test was contradicting production code.
- F3: IClientContext.getAuthProvider now optional; consumers use ?.() so
  external implementers don't break on upgrade.

High / privacy
- F4: explicit DATABRICKS_TELEMETRY_DISABLED parser (1/true/yes/on, case
  insensitive). Avoids the footgun where DATABRICKS_TELEMETRY_DISABLED=false
  also disabled telemetry. Documented in CHANGELOG and TSDoc.
- F12: TelemetryClient.registerContext warns on telemetry-config and
  userAgentEntry divergence so multi-tenant misconfig is visible.
- F9: connect()-on-reconnect releases prior refcount; close() clears the
  emitter ref so post-close events can't smuggle into a closed aggregator.
- M-1: redactSensitive strips /home/<user>/, /Users/<user>/, and
  C:\Users\<user>\ patterns from stack traces.
- M-3: FeatureFlagCache.getAuthHeaders falls through to the context's auth
  provider — feature-flag GET is no longer unconditionally unauth.

Operational
- F7: MetricsAggregator.close races the final flush against a configurable
  telemetryCloseTimeoutMs (default 2s) so a flapping endpoint can't hang
  process.exit(0).
- F8: flushInFlight serializer prevents concurrent fire-and-forget flushes
  from starving the user's HTTP socket pool. Drain pattern in close()
  awaits any in-flight flush, then issues a fresh one to capture
  close-time metrics that would otherwise be stranded.
- F16: maxStatementMetrics cap (default 5000) with oldest-first eviction.
  Buffered errors emitted as standalone metrics first so the first-failure
  signal survives.
- DBSQLSession.close() emits connection.close even when closeSession
  fails so failed-close rates are visible in dashboards.

Maintainability
- F10/F17: single withErrorTelemetry helper covers fetchChunk, cancel,
  close, finished, hasMoreRows, getSchema, getMetadata. safeEmit helper
  consolidates seven copy-pasted "get emitter, emit, swallow at debug"
  blocks across DBSQLOperation, DBSQLClient, DBSQLSession,
  CloudFetchResultHandler, RowSetProvider. Also fixes the inconsistency
  where DBSQLSession.close() lacked the swallow wrapper that the other
  six sites had.

API surface
- F13: ConnectionOptions exposes nine telemetry knobs (was three) with
  TSDoc. Adds telemetryFlushIntervalMs, telemetryMaxRetries,
  telemetryCircuitBreakerThreshold, telemetryCircuitBreakerTimeout,
  telemetryCloseTimeoutMs, telemetryMaxStatementMetrics.

Tests
- ClientContextStub gains telemetryEmitter / telemetryAggregator hooks so
  unit tests can assert on emit calls instead of silently no-op'ing.
- 18 new unit tests covering F1 refcount, F12 divergence warn, async-close
  idempotency, error-telemetry wrappers (cancel, close, getMetadata,
  closed-op finished/getSchema/hasMoreRows), multi-context FIFO, and a
  new tests/unit/result/RowSetProvider.test.ts file (RowSetProvider had
  no test file at all). 783 unit tests pass; live e2e against
  adb-27363120558779.19.azuredatabricks.net validates the full pipeline.

Co-authored-by: Isaac

* chore(lint): apply prettier to MetricsAggregator

Co-authored-by: Isaac

* fix(telemetry): address review highs — refcount leak, FIFO fallthrough, public types, README, test gaps

- DBSQLClient.initializeTelemetry: release the per-host refcount on the
  catch path so a thr…
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants